[1] 1599 12
[1] "fixed.acidity" "volatile.acidity" "citric.acid"
[4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
[7] "total.sulfur.dioxide" "density" "pH"
[10] "sulphates" "alcohol" "quality"
'data.frame': 1599 obs. of 12 variables:
$ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
$ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
$ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
$ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
$ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
$ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
$ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
$ density : num 0.998 0.997 0.997 0.998 0.998 ...
$ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
$ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
$ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
$ quality : int 5 5 5 6 5 5 5 7 7 5 ...
fixed.acidity volatile.acidity citric.acid residual.sugar
Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
chlorides free.sulfur.dioxide total.sulfur.dioxide
Min. :0.01200 Min. : 1.00 Min. : 6.00
1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
Median :0.07900 Median :14.00 Median : 38.00
Mean :0.08747 Mean :15.87 Mean : 46.47
3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
Max. :0.61100 Max. :72.00 Max. :289.00
density pH sulphates alcohol
Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
quality
Min. :3.000
1st Qu.:5.000
Median :6.000
Mean :5.636
3rd Qu.:6.000
Max. :8.000
Most red wines have fixed acidity between 7.10 g/dm^3 and 9.20 g/dm^3.
Most red wines have volatile acidity between 039 g/dm^3 and 0.64 g/dm^3. There are some outliers above 1.5
The citric acid has three peaks around 0, 0.25 and 0.5g/dm^3.
FALSE TRUE
1467 132
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.090 0.260 0.271 0.420 1.000
0 0.49 0.24 0.02 0.26 0.1 0.01 0.08 0.21 0.32 0.03 0.09 0.3 0.31 0.04
132 68 51 50 38 35 33 33 33 32 30 30 30 30 29
0.4 0.42 0.39 0.12 0.22 0.25 0.2 0.23 0.33 0.06 0.34 0.44 0.48 0.07 0.18
29 29 28 27 27 27 25 25 25 24 24 23 23 22 22
0.45 0.14 0.19 0.29 0.05 0.27 0.36 0.5 0.15 0.28 0.37 0.46 0.13 0.47 0.52
22 21 21 21 20 20 20 20 19 19 19 19 18 18 17
0.17 0.41 0.11 0.43 0.38 0.53 0.66 0.35 0.51 0.54 0.55 0.68 0.63 0.16 0.57
16 16 15 15 14 14 14 13 13 13 12 11 10 9 9
0.58 0.6 0.64 0.56 0.59 0.65 0.69 0.74 0.73 0.76 0.61 0.67 0.7 0.62 0.71
9 9 9 8 8 7 4 4 3 3 2 2 2 1 1
0.72 0.75 0.78 0.79 1
1 1 1 1 1
About 9% red wines have no citric acid. There is an outlier that has 1.0 g/dm^3.
Adjust x-axis by removing outliers and bin width for better visualization.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.900 1.900 2.200 2.539 2.600 15.500
The histotram of residual sugar has one peak and long-tailed. Most of red wines have residual sugar between 1.9 g/dm^3 to 2.6 g/dm^3: median 2.2g/dm^3 and mean 2.539 g/dm^3.
Transformed x-axis with log10() for better visualization.
chlorides seems to have some outliers.
0.08 0.074 0.076 0.078 0.084 0.071 0.077 0.082 0.075 0.079 0.081 0.07
66 55 51 51 49 47 47 46 45 43 40 35
0.073 0.083 0.066 0.088 0.086 0.068 0.067 0.085 0.087 0.089 0.062 0.072
35 35 32 32 31 30 27 25 25 25 24 24
0.065 0.095 0.063 0.092 0.069 0.09 0.093 0.064 0.091 0.094 0.096 0.097
23 23 22 22 21 21 21 20 19 19 18 18
0.059 0.06 0.104 0.058 0.054 0.1 0.05 0.098 0.061 0.114 0.052 0.057
17 16 16 14 13 13 12 12 11 11 10 10
0.102 0.056 0.107 0.048 0.049 0.055 0.099 0.106 0.11 0.118 0.103 0.111
10 9 9 8 8 8 8 8 8 8 7 7
0.122 0.105 0.112 0.123 0.044 0.053 0.101 0.115 0.039 0.041 0.045 0.046
7 6 6 6 5 5 5 5 4 4 4 4
0.047 0.117 0.132 0.042 0.109 0.119 0.12 0.124 0.157 0.166 0.214 0.415
4 4 4 3 3 3 3 3 3 3 3 3
0.012 0.038 0.116 0.121 0.152 0.171 0.178 0.205 0.226 0.414 0.034 0.043
2 2 2 2 2 2 2 2 2 2 1 1
0.051 0.108 0.113 0.125 0.126 0.127 0.128 0.136 0.137 0.143 0.145 0.146
1 1 1 1 1 1 1 1 1 1 1 1
0.147 0.148 0.153 0.159 0.161 0.165 0.168 0.169 0.17 0.172 0.174 0.176
1 1 1 1 1 1 1 1 1 1 1 1
0.186 0.19 0.194 0.2 0.213 0.216 0.222 0.23 0.235 0.236 0.241 0.243
1 1 1 1 1 1 1 1 1 1 1 1
0.25 0.263 0.267 0.27 0.332 0.337 0.341 0.343 0.358 0.36 0.368 0.369
1 1 1 1 1 1 1 1 1 1 1 1
0.387 0.401 0.403 0.413 0.422 0.464 0.467 0.61 0.611
1 1 1 1 1 1 1 1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Most of red wines have chlorides between 0.07 g/dm^3 to 0.09 g/dm^3: median 0.079 g/dm^3 and mean 0.08747 g/dm^3.
Removed outliers above 0.3 g/dm^3 and adjusted bin width for better looking.
There seems to be some outliers on histogram of free.sulfur.dioxide
6 5 10 15 12 7 9 16 17 11 13 8 14 3 18
138 104 79 78 75 71 62 61 60 59 57 56 50 49 46
4 21 19 24 23 26 20 27 25 28 29 22 32 31 34
41 41 39 34 32 32 30 29 24 23 23 22 22 20 18
30 35 33 36 38 41 40 39 48 51 1 37 42 43 45
16 15 11 11 9 7 6 5 4 4 3 3 3 3 3
52 37.5 50 55 68 2 5.5 40.5 46 47 53 54 57 66 72
3 2 2 2 2 1 1 1 1 1 1 1 1 1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 7.00 14.00 15.87 21.00 72.00
Most free.sulfur.dioxide values are integers except 2 of them and Most of them are between 7 mg/dm^3 and 21 mg/dm^3.
Removed outliers above 60 mg/dm^3.
There are 2 outliers above 170 mg/dm^3.
28 24 15 18 23 14 20 31 38 27 12 19 13 10 17
43 36 35 35 34 33 33 32 31 30 29 29 28 27 27
25 11 16 35 37 42 21 22 26 47 44 48 49 29 32
27 26 26 26 26 26 25 25 24 24 23 21 21 20 20
34 45 54 43 60 33 40 46 65 39 52 8 9 30 41
20 20 20 18 18 17 17 17 17 16 15 14 14 14 14
53 58 88 55 63 36 67 50 51 56 64 68 72 86 59
14 14 14 13 13 12 12 11 11 10 10 10 10 10 9
61 62 66 85 89 69 70 74 77 92 94 71 73 91 98
9 9 9 9 9 8 8 8 8 8 8 7 7 7 7
119 57 81 84 87 99 102 106 110 75 79 90 96 104 105
7 6 6 6 6 6 6 6 6 5 5 5 5 5 5
7 78 80 82 95 101 109 113 121 6 76 100 108 111 112
4 4 4 4 4 4 4 4 4 3 3 3 3 3 3
122 124 129 131 133 141 144 145 147 77.5 83 93 103 114 115
3 3 3 3 3 3 3 3 3 2 2 2 2 2 2
120 125 127 128 134 135 136 143 148 151 116 126 130 139 140
2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
142 149 152 153 155 160 165 278 289
1 1 1 1 1 1 1 1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.00 22.00 38.00 46.47 62.00 289.00
All of total.sulfur.dioxide values are integers. Most red wines have a total.sulfur.dioxide between 22 mg/dm^3 and 62 mg/dm^3. median is 38 mg/dm^3.
I removed outliers for better visualization.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The density value seems to display a normal distribution with major values between 0.995 and 1.0.
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.740 3.210 3.310 3.311 3.400 4.010
The pH also seems to have a normal distribution. Most of red wines have a pH between 3.21 and 3.4: median 3.31 and mean 3.311.
The sulphates has outliers above 1.5 g/dm^3 and has peak around 0.6.
0.33 0.37 0.39 0.4 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52
1 2 6 4 5 8 16 12 18 19 29 31 27 26 47
0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67
51 68 50 60 55 68 51 69 45 61 48 46 41 42 36
0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82
35 23 33 26 28 26 26 20 25 26 23 18 19 15 22
0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97
15 13 14 13 13 7 7 8 8 5 10 4 2 3 6
0.98 0.99 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.1 1.11 1.12
2 3 1 1 3 2 2 3 4 2 3 1 2 1 1
1.13 1.14 1.15 1.16 1.17 1.18 1.2 1.22 1.26 1.28 1.31 1.33 1.34 1.36 1.56
2 2 1 1 5 3 1 1 1 2 1 1 1 3 1
1.59 1.61 1.62 1.95 1.98 2
1 1 1 2 1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Median of sulphates is 0.62 g/dm^3.
Ignored above 1.4 g/dm^3 as outliers for better visualization.
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.40 9.50 10.20 10.42 11.10 14.90
The alcohol varies between 8 to 14 with major peaks around 10. Most of red wines have a alcohol between 9.5 and 11.1: median 10.2 and mean 10.42.
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.000 5.000 6.000 5.636 6.000 8.000
5 6 7 4 8 3
681 638 199 53 18 10
All of quality values are integers and between 3 and 8. Most of red wines have a quality between 5 and 6: median 6 and mean 5.636
low medium high
10 1372 217
I created “quality_class” for simple categorical analysis. It has three level of quality.
85.8% (1372 / 1599) are medium quality
There are 1599 red wines and have 13 variables(11 input features and 2 output features. (quality and quailti_class). There are 12 variables from the csv files. I added 1 varaible for the analysis.
The main features in the data set is quality. I’d like to find which chemical properties influence the quality of red wine. I suspect alcohol is highly related with quality, since red wine is a kind of liquor.
I think flavor or taste may highly related with quality. So acid related varaiables such as pH or citric.acid and residual.sugar will help the investigation.
All of given variables have numerial values. It makes difficult on bi or multivariate analysis. So, I created “quality_class” variable for bivariate or multivariate analysis of “quality” feature.
The citric acid has three peaks around 0, 0.25 and 0.5g/dm^3. 132 red wines (about 9%) have no citric.acid.
‘quality’ variable is just numerical value. So, I converted it to 6 levels factored variable named ‘quality_class’ for better visualization of histogram and further categorical analysis.
fixed.acidity volatile.acidity citric.acid
fixed.acidity 1.00000000 -0.256130895 0.67170343
volatile.acidity -0.25613089 1.000000000 -0.55249568
citric.acid 0.67170343 -0.552495685 1.00000000
residual.sugar 0.11477672 0.001917882 0.14357716
chlorides 0.09370519 0.061297772 0.20382291
free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
density 0.66804729 0.022026232 0.36494718
pH -0.68297819 0.234937294 -0.54190414
sulphates 0.18300566 -0.260986685 0.31277004
alcohol -0.06166827 -0.202288027 0.10990325
quality 0.12405165 -0.390557780 0.22637251
residual.sugar chlorides free.sulfur.dioxide
fixed.acidity 0.114776724 0.093705186 -0.153794193
volatile.acidity 0.001917882 0.061297772 -0.010503827
citric.acid 0.143577162 0.203822914 -0.060978129
residual.sugar 1.000000000 0.055609535 0.187048995
chlorides 0.055609535 1.000000000 0.005562147
free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
density 0.355283371 0.200632327 -0.021945831
pH -0.085652422 -0.265026131 0.070377499
sulphates 0.005527121 0.371260481 0.051657572
alcohol 0.042075437 -0.221140545 -0.069408354
quality 0.013731637 -0.128906560 -0.050656057
total.sulfur.dioxide density pH
fixed.acidity -0.11318144 0.66804729 -0.68297819
volatile.acidity 0.07647000 0.02202623 0.23493729
citric.acid 0.03553302 0.36494718 -0.54190414
residual.sugar 0.20302788 0.35528337 -0.08565242
chlorides 0.04740047 0.20063233 -0.26502613
free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
density 0.07126948 1.00000000 -0.34169933
pH -0.06649456 -0.34169933 1.00000000
sulphates 0.04294684 0.14850641 -0.19664760
alcohol -0.20565394 -0.49617977 0.20563251
quality -0.18510029 -0.17491923 -0.05773139
sulphates alcohol quality
fixed.acidity 0.183005664 -0.06166827 0.12405165
volatile.acidity -0.260986685 -0.20228803 -0.39055778
citric.acid 0.312770044 0.10990325 0.22637251
residual.sugar 0.005527121 0.04207544 0.01373164
chlorides 0.371260481 -0.22114054 -0.12890656
free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
density 0.148506412 -0.49617977 -0.17491923
pH -0.196647602 0.20563251 -0.05773139
sulphates 1.000000000 0.09359475 0.25139708
alcohol 0.093594750 1.00000000 0.47616632
quality 0.251397079 0.47616632 1.00000000
The alcohol and sulphates are the most correlated features with quality. The volatile.acidity is the best negatively correlated with quality.
First, I will look into scatterplots involving quality and highly correlated variables, such as alcohol, sulphates, volatile.acidity.
Since the scatterplot is overplotted, I used jitter and alpha for a better visual.
I used factored variable ‘quality’ for boxplot. The boxplot shows that medians alcohol value of each quality have positive slope.
With density function, we can see the positive correlation between alcohol and quality.
Call:
lm(formula = quality ~ alcohol, data = wineSubset)
Residuals:
Min 1Q Median 3Q Max
-2.8489 -0.4065 -0.1787 0.5176 2.5909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.81782 0.17512 10.38 <2e-16 ***
alcohol 0.36646 0.01672 21.92 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7083 on 1596 degrees of freedom
Multiple R-squared: 0.2314, Adjusted R-squared: 0.2309
F-statistic: 480.4 on 1 and 1596 DF, p-value: < 2.2e-16
The linear model of alcohol and quality has R-squred value 0.2314. ‘wineSubset’ is the subset of original data set without outlier of alcohol above 99.9%.
Call:
lm(formula = quality ~ sulphates, data = wine_sulphates_Subset)
Residuals:
Min 1Q Median 3Q Max
-3.02595 -0.51097 -0.02595 0.47064 2.39707
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.44423 0.09018 49.28 <2e-16 ***
sulphates 1.83920 0.13573 13.55 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7653 on 1581 degrees of freedom
Multiple R-squared: 0.1041, Adjusted R-squared: 0.1035
F-statistic: 183.6 on 1 and 1581 DF, p-value: < 2.2e-16
‘sulphates’ is second positively correlated with quality. After removing outlier above 99%, The linear model has R-squred value 0.1041.
Call:
lm(formula = quality ~ volatile.acidity, data = wine_v_acidity_Subset)
Residuals:
Min 1Q Median 3Q Max
-2.78977 -0.54547 -0.01325 0.47198 2.92568
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.55757 0.05841 112.27 <2e-16 ***
volatile.acidity -1.74500 0.10503 -16.61 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7436 on 1596 degrees of freedom
Multiple R-squared: 0.1474, Adjusted R-squared: 0.1469
F-statistic: 276 on 1 and 1596 DF, p-value: < 2.2e-16
‘volatile.acidity’ is most negatively correlated with quality. The linear model has R-squred value 0.1474 without outlier abvoe 99.9%.
I’ll investigate flavor related variables and quality.
Call:
lm(formula = quality ~ residual.sugar, data = wine_r.sugar_subset)
Residuals:
Min 1Q Median 3Q Max
-2.6743 -0.6319 0.3560 0.3717 2.3778
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.60528 0.05277 106.220 <2e-16 ***
residual.sugar 0.01211 0.01993 0.608 0.544
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.809 on 1581 degrees of freedom
Multiple R-squared: 0.0002335, Adjusted R-squared: -0.0003989
F-statistic: 0.3692 on 1 and 1581 DF, p-value: 0.5435
It seems to no relationship between residual.sugar and quality. Sweet flavor does not affect on deciding quality.
Call:
lm(formula = quality ~ chlorides, data = wine_chlorides_subset)
Residuals:
Min 1Q Median 3Q Max
-2.7136 -0.6505 0.2800 0.3653 2.3653
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.90638 0.05881 100.430 < 2e-16 ***
chlorides -3.15950 0.65812 -4.801 1.73e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.803 on 1581 degrees of freedom
Multiple R-squared: 0.01437, Adjusted R-squared: 0.01375
F-statistic: 23.05 on 1 and 1581 DF, p-value: 1.73e-06
‘chlorides’ show the amount of salt in wine. salty flavor also merely related with quality. Linear model of chlorides and quality has R-squared value is 0.01437.
Call:
lm(formula = quality ~ citric.acid, data = wine_c.acid_subset)
Residuals:
Min 1Q Median 3Q Max
-3.01809 -0.59820 0.09909 0.50922 2.59711
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.37360 0.03371 159.384 <2e-16 ***
citric.acid 0.97651 0.10144 9.627 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7847 on 1595 degrees of freedom
Multiple R-squared: 0.05491, Adjusted R-squared: 0.05432
F-statistic: 92.68 on 1 and 1595 DF, p-value: < 2.2e-16
R-squred value of linear model between citric.acid and quality is 0.055 but it is much bigger than chlorides and residual.sugar.
It is not related with finding variables for deciding quality. There are some variables related each other. such as sulfur dioxide family and acidity family.
Call:
lm(formula = total.sulfur.dioxide ~ free.sulfur.dioxide, data = wineQuality)
Residuals:
Min 1Q Median 3Q Max
-55.120 -13.534 -7.325 7.570 197.126
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.13535 1.11367 11.79 <2e-16 ***
free.sulfur.dioxide 2.09969 0.05858 35.84 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 24.5 on 1597 degrees of freedom
Multiple R-squared: 0.4458, Adjusted R-squared: 0.4454
F-statistic: 1285 on 1 and 1597 DF, p-value: < 2.2e-16
total.sulfur.dioxide seems to be relatively high correlated with free.sulfur.dioxide. R-squraed value is 0.4458 for its linear model.
Call:
lm(formula = pH ~ volatile.acidity, data = wine_v_acidity_Subset)
Residuals:
Min 1Q Median 3Q Max
-0.56954 -0.09889 0.00115 0.09310 0.65578
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.20394 0.01179 271.649 <2e-16 ***
volatile.acidity 0.20307 0.02121 9.575 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1502 on 1596 degrees of freedom
Multiple R-squared: 0.05432, Adjusted R-squared: 0.05373
F-statistic: 91.68 on 1 and 1596 DF, p-value: < 2.2e-16
R-squared value is relavely low 0.05432. pH does not affected by volatile.acidity.
Call:
lm(formula = pH ~ fixed.acidity, data = wine_f.acid_subset)
Residuals:
Min 1Q Median 3Q Max
-0.51754 -0.06571 0.00170 0.06486 0.52156
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.81657 0.01385 275.64 <2e-16 ***
fixed.acidity -0.06076 0.00163 -37.27 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1128 on 1596 degrees of freedom
Multiple R-squared: 0.4654, Adjusted R-squared: 0.465
F-statistic: 1389 on 1 and 1596 DF, p-value: < 2.2e-16
Based on R-squared balue, fixed.acidity can explain about 46.5% of the variance in pH. As the median value of fixed.acidity 7.9 g/dm^3 and volatile.acidity 0.52 g/dm^3, fixed.acidity mainly affects pH.
Call:
lm(formula = pH ~ citric.acid, data = wineQuality)
Residuals:
Min 1Q Median 3Q Max
-0.50025 -0.07733 -0.00570 0.08251 0.58251
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.427491 0.005562 616.25 <2e-16 ***
citric.acid -0.429477 0.016668 -25.77 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1298 on 1597 degrees of freedom
Multiple R-squared: 0.2937, Adjusted R-squared: 0.2932
F-statistic: 664 on 1 and 1597 DF, p-value: < 2.2e-16
R-squared value is 0.2937 and it is bigger than volatile.acidity linear model. Median value of citric.acid is 0.26g/dm^3 and smaller than volatile.acidity(0.52g/dm^3)
Based on correlation value, I investigated relationship between quality and other variables. Alcohol is the most influential chemical of red wine.
At the view of flavors, there seems to be no highly related variables to quality. Especially, salty(chloride) and sweet(residual.sugar) taste have almost no influence with quality of wine.
I observed chemically similar variables. As expected, free.sulfur.dioxide and total.sulfur.dioxide is highly related. (“Total”" includes “free”“)
Relationship between acid variables and pH is also interesting. As expected, higher acidity shows low pH except volatile.acidity. linear model between volatile.acidity and pH has low R-squared value 0.05432. Fixed acidity is most dominent fluence on pH and its R-square value is 0.4654, as we expected on median values of acid variables.
Relationship between fixed.acidity and pH is most highly related. Its correlation value is 0.68 and R-squared value of linear model is 0.4654.
First, I’d like to investigate base on correlation.
It shows weak relationship between alcohol and volatile.acidity. However It is hard to find distributions by quality. So I used facet_wrap().
It is easily found that lowest(3) and highest(8) quality are distributed differently on the scatterplot.
Call:
lm(formula = quality ~ volatile.acidity/alcohol, data = wineQuality)
Residuals:
Min 1Q Median 3Q Max
-2.65872 -0.41597 -0.03492 0.47365 2.12949
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.43388 0.05333 120.65 <2e-16 ***
volatile.acidity -7.26558 0.32048 -22.67 <2e-16 ***
volatile.acidity:alcohol 0.55595 0.03092 17.98 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6784 on 1596 degrees of freedom
Multiple R-squared: 0.2953, Adjusted R-squared: 0.2944
F-statistic: 334.3 on 2 and 1596 DF, p-value: < 2.2e-16
volatile.acidity / alcohol is negatively correlated with quality. Its linear model has 0.2954 R-squared value.
Call:
lm(formula = quality ~ sulphates/alcohol, data = wineQuality)
Residuals:
Min 1Q Median 3Q Max
-2.7909 -0.3643 -0.1458 0.5097 2.4503
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.86396 0.06846 71.05 <2e-16 ***
sulphates -4.25244 0.26375 -16.12 <2e-16 ***
sulphates:alcohol 0.51926 0.02322 22.36 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6825 on 1596 degrees of freedom
Multiple R-squared: 0.2866, Adjusted R-squared: 0.2857
F-statistic: 320.6 on 2 and 1596 DF, p-value: < 2.2e-16
Ratio of sulphates and alcohol is alsho negatively related with quality with R-squared value 0.2866.
Call:
lm(formula = quality ~ sulphates/volatile.acidity, data = wineQuality)
Residuals:
Min 1Q Median 3Q Max
-2.94529 -0.50157 -0.03746 0.48209 2.89781
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.13683 0.07624 67.37 <2e-16 ***
sulphates 1.99426 0.12124 16.45 <2e-16 ***
sulphates:volatile.acidity -2.39588 0.16352 -14.65 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7343 on 1596 degrees of freedom
Multiple R-squared: 0.1743, Adjusted R-squared: 0.1732
F-statistic: 168.4 on 2 and 1596 DF, p-value: < 2.2e-16
sulphates / volatile.acidity is relatively postive related with quality. Its linear model has 0.1743 R-squared value.
I’d like to continue from bivariate plots by adding quality variable. First, sulfur dioxide famliy (free.sulfur.dioxide and total.sulur.dioxide)
Sulfur dioxide family positively correlated. There is no sigficant difference of each quality distribution.
“pH” is related with acidity. I’ll check relationship between pH and acid related variables: volatile.acidity, fixed.acidity and citric.acid
pH is posively correlated with volatile.acidity with low correlation value but negatively correlated with fixed.acidity and citric.acid. There is no noticeable among quality class.
Calls:
m1: lm(formula = quality ~ alcohol, data = wineSubset)
m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wineSubset)
m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates,
data = wineSubset)
===============================================
m1 m2 m3
-----------------------------------------------
(Intercept) 1.818*** 3.038*** 2.547***
(0.175) (0.185) (0.196)
alcohol 0.366*** 0.319*** 0.315***
(0.017) (0.016) (0.016)
volatile.acidity -1.384*** -1.221***
(0.095) (0.097)
sulphates 0.685***
(0.100)
-----------------------------------------------
R-squared 0.231 0.322 0.341
adj. R-squared 0.231 0.321 0.340
sigma 0.708 0.666 0.656
F 480.388 378.330 274.938
p 0.000 0.000 0.000
Log-likelihood -1715.379 -1615.409 -1592.411
Deviance 800.741 706.568 686.521
AIC 3436.757 3238.818 3194.823
BIC 3452.887 3260.324 3221.705
N 1598 1598 1598
===============================================
With three hightest correlated variables (alcohol, volatile.acidity and sulphates), I build linear model for quality. Its R-squared value is 0.341.
Relationship between quality and alcohol is strenthed by add another variables highly correlated with qulaity.
In case of quality and volatile.acidity / alcohol. The R-squared value : 0.2954
Linear model of Quality and sulphates / alcohol. The R-squared value : 0.2866
Both are higher than the R-squared value (0.2314) of linear model of quality and alcohol.
Relationship volatile.acidity and alcohol shows interesting result with using categrical quality_calss variable. The lowest(3) and highest(8) quality are distributed differently on the scatterplot.
Relationship among sulphates, alcohol and quality is observed interesting in same manner. The lowest(3) and highest(8) quality are distributed distantly.
The highest correlated value with quality is alcohol. The linear model of quality and alcohol has R^2 value 0.2314. To improve R squared value, I added volatile.acidity and sulphates. The linear model of quality with three variable has 0.341 R squared value.
There are three flavors in red wine, such as sweetness, salty and freshness. Base on each R-squared value, freshness(citric.acid) is most important flavor for quality decision. Sweetness is not related with quality of red wine. Important flavor order and R-squared values are followed below.
Using relationship volatile.acidity and sulphates with alcohol, It is shown that highest and lowest quality of red wine have noticeably diffrent distribution.
Low ratio of volatile.acidity and alcohol indicates high quality red wine and high ratio of sulphates and alcohol indicates high quality red wine.
The data set contains 1599 red variants of the Portuguese “Vinho Verde” wine. I started by understanding the individual variables in the data set, and I was interested in “alcohol” feature because wine is a kind of liquor.
Since dataset is tidy, I don’t need to clean of filter it. However, all variables are numerical variables and It is hard to make bi or multi-variate plots. So I made a categorical variable ‘quality_class’. Even though, I used categorical variable, It is hard to recogize difference of distribution by quality in one plot. I made each plot for categorical quality variable or filtered some interesting qualities.
I presumed sweetness is highly related with quality of red wine. But surprisingly, the highest important flavor of red wine is freshness by citric acid while sweetness is the lowest import flavor.
As I expected, the most correlated feature of quality is “alcohol” and there are another features that has relation with quality. “volatile.acidity” is also correlated with quality and “sulphates” is negatively correlated. The linear model with only “alcohol” variable has 0.231 R-sqaured value. By adding “volatile.acidity” and “sulphates”, R-squared value is increased with 0.341.
Since the data set consists of samples from the specific red wine mentioned above, there is a limitation of this analysis. It might be interesting to obtain data set from various regions to eliminate any bias created by various products.